Skip to content

Adding Doc.content_hash and integrating for unique DocDetails.doc_id #1029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jamesbraza
Copy link
Collaborator

This PR:

  • Enables DocDetails.doc_id deduplication across files sharing the beginning (e.g. a main text and a supplemental information whose metadata is inferred to be identical)
    • Adds a content_hash field on Doc (which we already collect during Docs.aadd)
    • Expands DocDetails.doc_id to be a composite key of DOI and content hash
    • Adds test coverage here in test_duplicate that does not work on current main
  • Substantially documents DocDetails metadata upgrades and file_location

Closes #1005

@jamesbraza jamesbraza self-assigned this Jul 24, 2025
@Copilot Copilot AI review requested due to automatic review settings July 24, 2025 21:58
@jamesbraza jamesbraza added the bug Something isn't working label Jul 24, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request labels Jul 24, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds content hashing capabilities to enable unique document identification for files that share metadata but have different content, such as main papers and their supplemental information. The implementation adds a content_hash field to the Doc class and integrates it into the DocDetails.doc_id generation to create composite keys based on both DOI and content hash.

  • Adds content_hash field to Doc class with auto-population from file contents
  • Updates DocDetails.doc_id to be a composite key of DOI and content hash via new compute_unique_doc_id function
  • Enhances test coverage with scenarios for duplicate detection across files with shared metadata but different content

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_paperqa.py Expands duplicate detection tests to validate content hash differentiation
src/paperqa/utils.py Adds compute_unique_doc_id function for generating composite document IDs
src/paperqa/types.py Adds content_hash field to Doc and updates DocDetails validation/merging logic
src/paperqa/llms.py Updates Doc instantiation to include content hash field
src/paperqa/docs.py Integrates content hash computation during document addition
src/paperqa/clients/__init__.py Updates metadata client to handle content hash field during doc upgrades
Comments suppressed due to low confidence (1)

tests/test_paperqa.py:1068

  • The test expects specific hardcoded dockey values, but these appear to be content hash-based values that could change if the document content or hashing logic changes. Consider using a more flexible assertion that validates the dockey format or length rather than exact values.
    assert doc_details.dockey in {"8ce7ddba9c9dcae6", "a353fa2478475c9c"}

if doi:
value_to_encode: str = doi.lower() + (content_hash or "")
else:
value_to_encode = content_hash or str(uuid4())
Copy link
Preview

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using str(uuid4()) as a fallback when both DOI and content_hash are None could lead to non-deterministic document IDs for the same document across different runs. Consider if this is the intended behavior or if a more deterministic fallback would be appropriate.

Suggested change
value_to_encode = content_hash or str(uuid4())
value_to_encode = content_hash or "default"

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recompute doc_id if there's changes in DocDetails.__add__, so I believe we're safe

dockey_is_content_hash = False
if dockey is None:
# md5 sum of file contents (not path!)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this comment to Doc.content_hash's description, fyi


# note we have some extra fields which may have come from reading the doc text,
# but aren't in the doc object, we add them here too.
extra_fields = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of "extra" here is an overloaded term:

  • Extra in a Pydantic context (and we have DocDetails.other, to add to the confusion)
  • Extra in a query context

Paired with vagueness such as "some extra fields", I decided to rename and reworded this whole section

), "Expected citation to be inferred"
assert shorter_flag_day.content_hash
assert flag_day.content_hash != shorter_flag_day.content_hash
assert flag_day.doc_id != shorter_flag_day.doc_id
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole PR basically enables this singular assertion to pass

elif "doc_id" not in data or not data["doc_id"]: # keep user defined doc_ids
data["doc_id"] = encode_id(uuid4())
if not data.get("doc_id"): # keep user defined doc_ids
data["doc_id"] = compute_unique_doc_id(doi, data.get("content_hash"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about this in person, but let's move this into an optional feature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I piloted this today some more btw. Moving compute_unique_doc_id into something like Settings.details_id_factory requires us to pass a throwaway Settings reference to all DocDetails constructions, which is both awkward and error prone.

Perhaps a less awkward route is having an bool environment variable PQA_USE_OLD_ID_FACTORY that gets used inside compute_unique_doc_id

Copy link
Collaborator

@mskarlin mskarlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read this in more detail and think about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dockey confilct because of same Doi
2 participants